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Abstract 

Wearable cameras, such as Google Glass and Go Pro, 
enable video data collection over larger areas and from dif¬ 
ferent views. In this paper, we tackle a new problem of lo¬ 
cating the co-interest person (CIP), i.e., the one who draws 
attention from most camera wearers, from temporally syn¬ 
chronized videos taken by multiple wearable cameras. Our 
basic idea is to exploit the motion patterns of people and 
use them to correlate the persons across different videos, in¬ 
stead of performing appearance-based matching as in tra¬ 
ditional video CO- segmentation/localization. This way, we 
can identify CIP even if a group of people with similar ap¬ 
pearance are present in the view. More specifically, we de¬ 
tect a set of persons on each frame as the candidates of 
the CIP and then build a Conditional Random Field (CRF) 
model to select the one with consistent motion patterns in 
different videos and high spacial-temporal consistency in 
each video. We collect three sets of we arable-camera videos 
for testing the proposed algorithm. All the involved people 
have similar appearances in the collected videos and the 
experiments demonstrate the effectiveness of the proposed 
algorithm. 

1. Introduction 

Video-based individual, interactive, and group activity 
recognition has attracted more and more interests in the 
computer vision community. Using fixed cameras for col¬ 
lecting videos suffers from the problem of only covering 
very limited areas. This problem will get even worse when 
recognizing activities in a social event, such as a concert, 
ceremony or party, where multiple people are present and 
move from time to time. Recently, wearable cameras, such 
as Google Glass or Go Pro, provide a new solution, where 
all or part of the involved persons wear a camera over head 
to record what they see over time [7, 19]. 

By combining the temporally synchronized videos from 
different wearers, we can recognize the activity occurred 


in a large area, because camera wearers can walk or move 
the head to follow the people or event of interest [32]. An 
important problem arising from this setting is to identify 
the co-interest person (CIP) that attracts the attentions from 
multiple wearers since this person usually plays a central 
role in ongoing event of interest. The CIP and his/her activ¬ 
ities are of particular importance for surveillance, anomaly 
detection and social network construction. For examples, 
in a public scenario such as an airport, CIP can be a person 
with abnormal behavior or activity who usually draws atten¬ 
tion from multiple camera-wearing security guards and the 
quick detection of such CIPs can promote the public secu¬ 
rity. In a kindergarten, CIP may be a kid with strange behav¬ 
ior that continuously draws joint attentions from camera- 
wearing teachers or other kids. In this case, the CIP detec¬ 
tion can facilitate the early findings of various child devel¬ 
opment issues. In a group discussion, people usually focus 
on the person who leads or gives the speech at any time and 
the identification of such CIPs over time can help summa¬ 
rize and edit all the videos from the attendee’s cameras for 
more effective information management and retrieval. In 
this paper, we develop a new approach to detect CIPs from 
multiple videos taken by wearable cameras. 
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(a) A co-interest person (in red boxes) identified in the same frame across different videos. 
Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 6 



(b) A co-interest person (in red boxes) identified along the same video. 


Figure 1. An illustration of the basic idea underlying the proposed 
CIP detection approach, (a) A CIP (in red boxes) always shows 
consistent 3D motion patterns across all the videos in which he/she 
is present, (b) A CIP (in red boxes) usually shows high spatial- 
temporal consistency along a video. Our proposed algorithm con¬ 
siders the consistency in both (a) and (b) for CIP detection. Note 
that the Video 3 in (a) is an egocentric video of the CIP. 

In many social events, attendees may wear clothes with 
similar color and texture, such as wearing specific uniforms 
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in work and suits in a formal dinner. In these cases, it is very 
difficult to identify CIPs by performing appearance match¬ 
ing across multiple videos, as shown in Fig. 1(a). In this 
paper, we identify the CIP based on his/her motion patterns: 
it is unlikely that two persons in the view keep showing ex¬ 
actly same motion over time. 

However, it is a very challenging problem to identify the 
person with the same motion pattern from different videos 
even if these videos are temporally synchronized, because 
the motion pattern of a person is defined in 3D and can only 
be partially refiected in each 2D video. In practice, the 3D 
motion of a same person may be projected to completely 
different 2D motions in different videos, as illustrated in 
Fig. 1(a). In addition, in this research, the inference of the 
2D motion pattern of a person is further complicated by the 
use of the wearable cameras: camera motion and person 
motion are mixed in generating each video. 

In this paper, we address this challenging problem by 
combining the temporally synchronized frames from differ¬ 
ent videos using a Conditional Random Field (CRF) model. 
We first perform human detection to obtain a set of candi¬ 
dates of the CIP. Then we build a CRF by taking each frame 
as a node and the candidates on that frame as its states. In 
this CRF, we define an inter-video energy that refiects the 
motion-pattern difference of the candidates drawn from dif¬ 
ferent videos, as illustrated in Fig. 1(a). In particular, we use 
histogram of optical fiow (HoF), Hankelets [16] and motion 
pattern histograms (MPH) [4] to describe the human mo¬ 
tion. We also include an intra-video energy term in the CRF 
to measure the location and size consistency of candidates 
across frames of a same video, as illustrated in Fig. 1(b). 
The minimization of the proposed CRF energy will gener¬ 
ate a CIP on each frame of each video that shows both inter¬ 
video and intra-video properties. To handle the case where 
a frame contains no CIP, e.g., the CIP can not see himself 
in his egocentric video, as shown by video 3 in Fig. 1(a), we 
also introduce an idle state in each frame. 

2. Related Work 

2.1. Video co-segmentation 

Related to this paper is a series of prior research on 
video co-segmentation, where common objects are seg¬ 
mented from multiple videos. Video co-segmentation can 
be treated as an extension of the long-studied image co¬ 
segmentation [5, 11, 12, 14, 17, 18, 22, 25, 26, 30], where 
the input is a set of images instead of videos. 

However, different from the proposed CIP detection, the 
multiple videos used for video co-segmentation are usually 
not temporarily synchronized: they may record the same 
object at different time. As a result, the co-segmented 
person may not show motion consistency across different 
videos. In practice, almost all the existing co-segmentation 


algorithms are based on object-appearance matching. For 
example, [2] and [23] model the co-segmentation as a fore¬ 
ground/background separation problem based on the ap¬ 
pearance information. Wang et al. [29] develop an appear¬ 
ance based weakly supervised co-segmentation algorithm 
which also needs the labels for a few frames. In [13], the 
common objects are localized in different videos by using 
the appearance and local features. 

Some of prior video co-segmentation methods use the 
motion information to help track and/or segment the ob¬ 
jects in each video but not corresponding objects across 
videos as in the proposed CIP detection. Chiu and Fritz 
[3] propose a multi-class co-segmentation algorithm based 
on a non-parametric Bayesian model which uses the motion 
information for object segmentation. In [31], a number of 
tracklets are detected inside each video and the appearance 
and shape information along the tracklets are then extracted 
to identify the common target in multiple videos. In [8], co¬ 
segmentation is formulated as a co-selection graph where 
motions are estimated to measure the spatial temporal con¬ 
sistency. In [10], motion trajectories are detected to match 
the action across video pairs. However, the action match¬ 
ing is only in the high-level of the action type. There is no 
frame-by-frame motion consistency between these videos 
since they are not temporally synchronized. 

In addition, when multiple people are present in the view 
of each video, most works on video co-segmentation iden¬ 
tify all of them as a common object - person. In the pro¬ 
posed CIP detection, we need to distinguish them and iden¬ 
tify one person with presence in all or most of the videos. 

2.2. Gaze concurrences 

Also related to our work is the research on gaze concur¬ 
rences of multiple video takers. Robertson and Reid [21] 
estimate face orientation by learning 2D face features from 
different views. In [24], the points of interest are estimated 
in a crowded scene. However, these methods rely on video 
data captured from a third person. As a result, the area cov¬ 
ered by these videos are quite limited and the accuracy of 
head pose estimation degrades when distance to the cam¬ 
era increases [20]. Park et al. present an algorithm to lo¬ 
cate gaze concurrences directly from videos taken by head- 
mounted cameras. However, this algorithm requires a prior 
scanning of the area of interest (for example, room or an 
auditorium) to reconstruct the reference structure. This may 
not be available in practice. 

3. Proposed Method 

To detect CIP over time, we record a set of N tempo¬ 
rally synchronized long-streaming videos that are taken by 
N wearable cameras over time [0,7^. The CIP in these 
videos may change over time. To simplify the problem, we 
first apply a sliding window technique to divide the time 


[ 0 , T] into overlapped short time windows with length T. 
Over each short time window, we assume that the CIP does 
not change in these N videos and we propose an algorithm 
to detect such a person in each window. The proposed algo¬ 
rithm also provides an energy for the CIP detection in each 
window. This energy value negatively reflected the confi¬ 
dence of CIP detection. Finally, we merge the CIP detec¬ 
tion results over all the windows based on their energies to 
achieve a CIP detection at each frame over [0,7^, as illus¬ 
trated in Fig. 2. 


CIP detection in each window 



Figure 2. The framework of the proposed algorithm. 
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Figure 3. An example that illustrates the merging of the CIP detec¬ 
tion results. 


date. A conditional random field (CRF) [ 6 ] is then con¬ 
structed by treating each frame as a node and each can¬ 
didate on this frame as a state of this node. Using this 
CRF model, our goal is to seek a candidate on each 
frame as the detected CIP. Specifically, the CIP detec¬ 
tion H = {h^\n = 1, • • • , A^; t = 1, • • • , T} has a poste¬ 
rior probability 

p(i7|F) oc exp(-F(i7|F)) 
with E{H\T)= (2) 

n,m,t,r 


To merge CIP detection results from all the windows, 
we always select the one with lowest detection energy at 
each frame. Specifically, by using sliding window tech¬ 
nique, the constructed windows are partially overlapped and 
each frame, say t, is covered by multiple windows, say 
bFi, IF 2 , • • • , Wk^ In each window Wk, the CIP detection 
algorithm (to be introduced in Section 3.1) generates a CIP 
detection and an associated energy . We find the one 
with the lowest energy as 

/c* = arg ^ (1) 

and set Pk* as the final CIP detection in this frame t. 

An example is shown in Fig. 3. In this figure, Wi denotes 
the partially overlapped windows, and Pi and Ei denote the 
CIP detected in each window Wi and its energy, respec¬ 
tively. If Pi = P 2 = P' and P 3 = P 4 = P 5 = Pe = P 7 = 
P", as shown in Fig. 3, then the red dashed line actually 
indicates a time when CIP is changed from P' to P". 

In the following, we focus on developing the proposed 
CIP detection algorithm in each window W. 

3.1. CIP detection using a CRF model 

Over a short-time window W, the N input videos are 
actually cropped into N synchronized short video clips T = 
{E^\n = 1 , 2 ,... ,N} withP^ = {E^\t = 1,... ,T} 
where E^ is the f-th frame in the n-th video clip. 

As shown in Fig. 2, we first perform the human detec¬ 
tion on each frame and take each detection as a CIP candi¬ 


where E^) is a energy of matching and 

h'^ as the same person and taking it as the CIP. In the re¬ 
mainder of the paper, we simplify the notation of this pair¬ 
wise energy as /ij?) and the energy function E{H\E) 

SiS E{H) when there is no ambiguity. This way, the CIP de¬ 
tection in the short time window is reduced to a problem of 
finding an optimal H that minimizes the energy E{H\E). 

The major problem to be solved here is the definition 
of the pairwise energy h'^), which should reflect the 

correspondence of the CIP between a pair of frames drawn 
from E. In this paper, we consider two cases: 1) the two 
frames are from the same video clip (intra-video), and 2 ) 
the two frames are from different video clips (inter-video). 
For Case 1), the CIP in a same video clip shows two typi¬ 
cal properties: (i) its relative location in the frame does not 
change much over time, because the camera wearer usually 
moves his head/eyes to follow the CIP even if the CIP is 
moving; (ii) The size of the CIP does not change much be¬ 
tween neighboring frames. For Case 2), we only consider 
the synchronized frame pairs from different video clips. In 
this case, the detected CIP should show consistent 3D mo¬ 
tions. 

In our CRF model, we define two different energies 1 
and for the intra-video and inter-video frame pairs, re¬ 
spectively, as illustrated in Fig. 4 and rewrite the energy 
function E{H) in Eq. (2) as 

E{H)= Y ^i{K,K)+ Y (3) 

n,t,ryi^t t,n,m^n 
















































Different from many previous works [6, 8], no unary en¬ 
ergy term is defined in this paper since we do not consider 
the candidate’s appearance information. The construction 
of and will be elaborated in the following section. 

3.2. Intra-Video Energy and Inter-Video Energy 

Intra-Video Energy. Ideally, a CIP that draws a camera- 
wearer’s attention usually stays in the view center of the 
wearer. However, the view center of the wearer may not 
be perfectly aligned with the center of the camera he/she 
wears. Therefore, we do not consider center bias in defining 
the intra-video energy in this work. Instead, the relative 
location of the CIP usually does not change much in a short 
video clip and we can penalize the location change between 
frames for CIP detection. In addition, in a short video clip, 
the size of CIP should not change substantially. Considering 
these two properties, we define the intra-vidoe energy as 

+<5(t,r-l)(l-(||sr-s”|| + l)-i) 

where cj? s'^) denote the center (size) of the can¬ 
didate in frame t and r in video n, respectively. S{x,y) is 
the indicator function that equals to 1 if x = ^ and 0 oth¬ 
erwise. The inclusion of this indicator function ensures that 
the penalty to the CIP size change is only defined for adja¬ 
cent frames. 

Inter-Video Energy. As mentioned above, the inter¬ 
video energy is based on motion patterns of the CIP. In this 
paper, we extract the motion patterns using two types of 
features: frame-based and trajectory based. 



Figure 5. Frame based motion feature extraction. 


The frame-based features are defined to measure the mo¬ 
mentary motion of the CIP using the information from a pair 
of neighboring frames. Specifically, we calculate the optical 
flow using neighboring frames [ 1 ]. To remove the infiuence 
of camera motion, we further calculate the relative optical 
flow for each candidate by subtracting the average optical 
flow in its surrounding region. An example is shown in the 
top row of Fig. 5(b), where the red box indicates a candi¬ 
date and the region between the red box and its surrounding 


blue box is taken for computing the average optical flow for 
subtraction. 

In this paper, we assume all the videos are taken from 
a similar altitude. This way, at a specific time the a 
3D vertical motion of the CIP should be projected to 
similar directions (up or down) in all the cameras but a 
3D horizontal motion may be projected to opposite di¬ 
rections in different cameras. For example, in Fig. 5(a), 
the same hand motion is from right to left when viewed 
from front, but from left to right when viewed from back. 
Therefore, in this paper we propose to ignore the hor¬ 
izontal motion direction information in constructing the 
frame-based features. Many previous works use a his¬ 
togram of optical flow (HOF) quantized at 8 directions: 
East(E), West(W), North(N), South(S), North-East(NE), 
North-West(NW), South-East(SE) and South-West(SW) as 
motion features. By ignoring the horizontal motion direc¬ 
tions, in this paper, we reduce these 8 directions into 5 by 
merging three histogram-bin pairs, i.e., merging NW into 
NE, W into E, and SW into SE, which are vertically sym¬ 
metric, as shown in Eig. 5(b). 

To construct the frame-based features for each CIP can¬ 
didate on each frame, we divide its bounding box along the 
vertical direction in a pyramid style, as shown Pig. 5(b). 
The bounding box is first uniformly divide into two smaller 
boxes, each of which is then further divided into two equal- 
size boxes. In our experiment, we perform 3 rounds of pyra¬ 
mid division and in total achieve 1 + 2 + 4 + 8 = 15 boxes 
in 4 scales for each candidate. By computing and concate¬ 
nating the 5-bin HOP (as mentioned above) for the original 
bounding box and the subdivided boxes, we construct an 
HOP based feature with a dimension of 5 x 15 = 75. 
Within each box (including the original bounding box and 
its subdivided boxes), we further compute the average mag¬ 
nitudes of the optical fiow along x and y directions, and 
the corresponding standard deviations of these magnitudes 
along X and y directions, respectively to construct a magni¬ 
tude based feature / with a dimension of 4 x 15 = 60. In 
this paper, the frame-based feature is defined as the union 
of the HOP-based and the magnitude-based features. 

In practice, the change of the camera angle usually re¬ 
sults in the change of the optical-fiow magnitudes in /. 
Therefore, when comparing frame-based features between 
two candidates, we use LI distance for the HOP-based fea¬ 
tures and the correlation metric for the magnitude features: 

K) = ^- exp(-||/r - /ni) + corr+, +)• 

(5) 

In addition to the frame-based features, we also extract 
trajectory-based features based on short tracklets to capture 
the motion over a longer time. In this paper, we use Han- 
kelets features and Movement Pattern Histograms (MPH) 
features for this purpose since both of them show good 
view-invariance property and have been successfully used 


































Figure 4. An illustration of the CRF construction for CIP detection. Each column denotes one video and each row denotes the same frame 
from different videos. We treat each frame as a node and the detected CIP candidates on each frame as the states of the node. In this CRF, 
the red lines indicate that the inter-video energies are defined over all pairs of frames between different videos. The green lines indicate 
that the location-change penalty term in the inter-video energy is defined between each pair of frames inside a video, and the purple lines 
indicate that the size penalty in the inter-video energy is defined only between neighboring frames inside a video. 


for cross-view action recognition [16, 4]. 

Tracklet. Starting from each candidate, we generate a 
tracklet with the typical length of 15 frames. In this pa¬ 
per, we use a simple greedy tracking strategy [31]: given a 
candidate in a frame, the candidate in the next frame with 
the highest spatial overlap is taken and this process is then 
repeated frame by frame to form the tracklet. 

Dense trajectory. Improved dense trajectories have 
been used to efficiently represent videos with camera mo¬ 
tions [28]. In this paper, we extract such improved trajec¬ 
tory features (typically 15 frames). If the majority part of a 
trajectory, e.g., on more than 8 out of 15 frames, is not co¬ 
incident with a tracklet, we treat it to be a trajectory in the 
background. In this paper, we remove background trajecto¬ 
ries and only keep the trajectories in the foreground. 

Hankelet. Following [16], we construct one Hankelet 
(a 16 X 8 Hankel matrix) for each trajectory. The Hankelets 
feature for a candidate is the combination of the Hanklets 
for all the trajectories in this candidate’s bounding box. 

MPH. The MPH features for a candidate’s trajectories 
consist of 5 histograms, corresponding to the 5 motion di¬ 
rections as used in the frame-based features (see Fig. 5(b)). 
For each direction, the histogram takes each frame as a bin 
and the histogram value corresponds to the total trajectory 
magnitude along this motion direction in this frame. 

The difference between two Hankelets Kr and Kg is de¬ 
fined as d{Kr,Ks) = 2 - \\KrK^ + [16]. As 

mentioned above, each candidate corresponds to a set of 
Hanklets, one for each trajectory. In this paper, we define 
the Hankelet based difference between two candidates as 
the average one over all Hankelet pairs across these two 
candidates. By using LI distance for the MPH features, we 


define the trajectory-based energy term as 

^ reh^-,seh'^ ^ d=i 

( 6 ) 

where Nh denotes the number of all different Hankelet 
pairs across two candidates and M^n indicates the d-th his¬ 
togram (in total 5 directions) in the MPH features. 

Finally, we define the inter-video energy as 

hT) = h^) + h^). 

3.3. Identifying the frames without CIP 

One problem of the CRF model defined above is its as¬ 
sumption that there is always a CIP in each frame. This 
may not be true in practice. For example, the CIP’s ego¬ 
centric video usually cannot capture himself. Similar issues 
may occur when the CIP is occluded in some of the frames. 
To handle this issue, we add an idle state for each node 
(frame). Let A = {A^\A^ = \J , N;t = 

1, • • • , T} denote the state set which includes the idle states 
. The energy function is redefined as 

E{A) = E{H)+Y^ AT), 

n,t,rj^t t,n,mj^n 

(7) 

where and denote the intra-video and inter-video en¬ 
ergies that involve idle states, respectively. In this paper, 
we simply define them using the average intra-video energy 













































and inter-video energy over the candidate pairs: 


4.2. Results 
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t,n,m^n 


( 8 ) 


where Wi and W 2 denote the number of all different can¬ 
didate pairs used in calculating the average intra-video and 
inter-video energies, respectively. As illustrated in Fig. 6, 
the average energy is located between the minimal energy 
for a pair of CIPs and the energies between a pair of can¬ 
didates with at least one non-CIP. This will facilitate the 
selection of idle state in a frame when the CIP is missing in 
this frame. 


Energy Increases 



Figure 6. An illustration of using average energy over all candidate 
pairs as the energy terms for idle states. The average energy is 
located between the minimal energy for a pair of CIPs and the 
energies between a pair of candidates with at least one non-CIP. 


Eq. (7) is also known as the discrete energy minimiza¬ 
tion [9, 15]. In this paper, we use the TRW-S algorithm [15] 
to solve for an approximately optimal solution. 


4 . Experimental Results 
4.1. Data collection 

We collect three sets of temporally synchronized videos 
taken by multiple wearable cameras. These three sets of 
videos, denoted as VI, V2 and V3 respectively, are taken 
in different scenes, including both indoor and outdoor set¬ 
tings. For each video set, there are 6 persons who are both 
performers and camera wearers and therefore generate 6 
videos. Each person wears a GoPro camera over the head. 
We arrange the video recording in a way that the 6 perform¬ 
ers alternately play as the CIP in the video recording by per¬ 
forming different actions. All 6 persons wear white shirts 
and bluish jeans thus sharing very similar appearances. We 
manually label the CIP by a bounding box in each frame 
by using the video annotation tool provided in [27] . In to¬ 
tal, we collected 24,000 frames (16 minutes), 25,000 frames 
(16 minutes 40 seconds) and 20,000 frames (13 minutes 20 
seconds) for these three video sets VI, V2, and V3 respec¬ 
tively. 


We first show an example to illustrate the effectiveness of 
the proposed motion features for identifying the same per¬ 
son from different videos that are temporally synchronized. 
As shown in Fig. 7(a), blue bounding boxes indicate the de¬ 
tected CIP candidates and red points indicate the improved 
dense trajectories for each candidate. The MPH features, 
the color histograms in Lab color channels, and the HOF 
features are visualized below the corresponding frames. In 
Fig. 7(b), confusion matrices between different candidates 
are given when using different features - each element in 
the confusion matrices indicates the energy in matching one 
candidate from frame FI and a candidate from frame F2. 
Note that Cl and D1 are the same person, and C2 and D2 are 
also the same person. Bold font in these matrices indicates 
the matching energy (i.e., feature difference) of the same 
person across these two frames and clearly the smaller, the 
better. We can see that when using the four motion features, 
these bold-font elements are usually the smallest elements 
in the respective confusion matrices. However, when using 
the color features, the bold-font elements are not the small¬ 
est in their respective confusion matrix. This shows that the 
motion features can be more effective than the color features 
in person identification when the involved people share a 
very similar appearance. 
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Figure 1 . An example to illustrate the effectiveness of the proposed 
motion features. 


We then evaluate the proposed algorithm on the collected 
three video sets. For each detected CIP, denoted by C, if 
there is a ground truth box G with an overlap O = 
larger than 0.5, we count this detected CIP C to be a true 
positive. In this way we can calculate the precision, recall, 



































































and the F-score= Table 1 shows the 

quantitative performance of the proposed algorithm and a 
state-of-the-art video co-segmentation method [31], as well 
as the variants of the proposed algorithm using different 
features. For the comparison method [31], instead of us¬ 
ing the object proposal result, we directly feed the bound¬ 
ing boxes of the detected candidates to its pipeline. “Frame 
based” and “Trajectory based” are the variants of the pro¬ 
posed methods using only the frame-based features and the 
trajectory-based features, respectively. “Color based” is an¬ 
other variant of the proposed method using only the color 
features of Lab histograms instead of any motion features. 
We can see that the comparison method [3 1 ] shows a similar 
performance as “Color based” and both of them do not per¬ 
form as good as the proposed algorithm. To demonstrate the 
usefulness of the location-change penalty term in Eq. (4), 
we also report the results of the proposed algorithm without 
this location-change penalty term, indicated by “w/o loca¬ 
tion penalty” in Table 1. 


Table 1. The performance of the proposed algorithm and its vari¬ 
ants, and a comparison video co-segmentation method [31]. 
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F-score 


VI 
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0.4405 

0.4317 

Color based 

V2 

0.4401 

0.4259 

0.4329 


V3 

0.3812 

0.4270 

0.4028 


VI 

0.4667 

0.5011 

0.4833 

Frame based 

V2 

0.4481 

0.5066 

0.4756 


V3 

0.4089 

0.4401 

0.4239 


VI 

0.5101 

0.5523 

0.5304 

Trajectory based 

V2 

0.4898 

0.5396 

0.5135 


V3 

0.4611 

0.5122 

0.4853 


VI 

0.4891 

0.5207 

0.5044 

w/o location penalty 

V2 

0.4622 

0.4758 

0.4689 


V3 

0.4532 

0.5107 

0.4802 


VI 

0.5598 

0.6036 

0.5809 

Proposed 

V2 

0.5287 

0.5682 

0.5477 


V3 

0.5027 

0.5984 

0.5464 


Note that the performance of the proposed algorithm is 
highly dependent on the accuracy of human detection that is 
used for candidate detection. If a CIP is present but not de¬ 
tected as a candidate, the proposed algorithm will surely fail 
to detect the CIP. We also conduct an experiment to evalu¬ 
ate the proposed CIP detection algorithm only on the frames 
where the underlying CIP is among the detected candidates. 


We hope this result can show the performance of the pro¬ 
posed CIP detection by excluding the errors from human 
detection. Specifically, if no detected candidate shows a 
larger-than-0.5 overlap (intersection divided by union) with 
the ground-truth CIP on a frame, we exclude the CIP detec¬ 
tion on this frame from the performance evaluation. Table 2 
shows the results before and after excluding such frames 
into evaluation. 

Table 2. The performances of the proposed method before and af¬ 
ter excluding the frames where the CIP is present but not among 
the detected candidates._ 


Models 

sets 

Precision 

Recall 

F-score 


VI 

0.5598 

0.6036 

0.5809 

Before 

V2 

0.5287 

0.5682 

0.5477 


V3 

0.5027 

0.5984 

0.5464 


VI 

0.6134 

0.6591 

0.6354 

After 

V2 

0.5960 

0.6011 

0.5985 


V3 

0.5789 

0.6603 

0.6169 


Figures 8 and 9 show the CIP detection results on sam¬ 
ple frames from VI and V3, respectively. Blue, red and 
green boxes indicate the detected candidates, the detected 
CIP and the ground truth, respectively. Frames with a solid 
red square on the top-left corner indicate that no CIP is de¬ 
tected by our algorithm, e.g., they are drawn from the CIP’s 
egocentric video or the CIP is occluded in these frames. 
Frames with a solid blue square on the top-left comer indi¬ 
cate that no candidate is detected in these frames. As shown 
in Fig. 8, the proposed algorithm can detect CIP even if the 
CIP shows similar appearance to other people in the same 
scene. From the top four rows of Video 3, the bottom row 
of Video 1, and the second row of Video 2 in Fig. 8, we 
can see that the proposed algorithm can handle CIP missing 
cases, e.g., on the frames drawn from the CIP’s egocentric 
video, by introducing the idle states. The second row of 
Video 4 in Fig. 8 shows a failure case, which is caused by 
the partial occlusion of the CIP. The top two rows of Video 
3 in Fig. 9 show another failure case where the CIP is not 
detected because it is not among the detected candidates. 

The most time consuming steps in the proposed algo¬ 
rithm are the extraction of the raw features, such as the 
dense trajectories and optical flow. The candidate detection 
is also time consuming. The major components of the al¬ 
gorithm, including the motion-feature generation, the CRF 
constmction and the CRF optimization, take an average 
time of 20 seconds (dependent on the number of candidates 
detected in a video clip) on a laptop with Intel i7-2620M 
CPU and 4GB RAM, where each CRF is constmcted for a 
lOO-frame window over 6 synchronized videos. Therefore, 
in total 600 frames are modeled by a CRF in our experi¬ 
ments. 





























Video 1 Video 2 Video 3 Video 4 Video 5 Video 6 



Figure 8. The CIP detection on sample frames from VI. Blue, red and green boxes indicate the detected candidates, the detected CIP and 
the ground truth, respectively. Frames with a solid red square on the top-left corner indicate that no CIP is detected by our algorithm, e.g., 
they are drawn from the CIP’s egocentric video or the CIP is occluded in these frames. Frames with a solid blue square on the top-left 
corner indicate that no candidate is detected in these frames. Best viewed in color. 


Video 1 Video 2 Video 3 Video 4 Video 5 Video 6 



Figure 9. The CIP detection on sample frames from V3. See the caption of Fig. 8 for the meaning of different-color boxes. Best viewed in 
color. 


5. Conclusions 

In this paper, we developed a new algorithm to detect co¬ 
interest persons (CIPs) from multiple, temporally synchro¬ 
nized videos that are taken by multiple wearable cameras 


from different view angles. In particular, the proposed algo¬ 
rithm extracts and matches the motion patterns across these 
videos for CIP detection and can handle the case where the 
CIP shares a very similar appearance to other nearby non- 


































































CIP persons. The proposed algorithm is based on a CRF 
model which integrates both intra-video and inter-video 
properties. In the experiments, we collected three video 
sets, each of which contains six 13-\- minute GoPro videos 
that are temporally synchronized for performance evalua¬ 
tion. The results show that the proposed alglorithm outper¬ 
forms a state-of-the-art video co-segmentation method and 
other color-based methods. 
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